Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct MLD-VC, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing.
On this supplementary demo page, we provide video samples from MLD-VC dataset. These demos cover recordings from the offline setting and from four video conferencing platforms (Zoom, Lark, Tencent Meeting, and DingTalk). In addition, we present video samples under both Plain and Lombard conditions. For each subset, we include samples from both male and female speakers and from both English and Mandarin. The navigation bar below allows quick access to each subset.
Introduction:In this part, we present video samples from the offline setting and from the online setting on four video conferencing platforms. In addition, we show the video samples under the Plain and Lombard (80 dB) conditions.The first row shows Plain condition. The second row shows Lombard (80 dB) condition.
Observations:Compared with the Offline recordings, the samples captured on video conferencing platforms often exhibit audio-visual asynchrony. Moreover, some samples even suffer from blurred frames and degraded audio quality due to network jitter. Nevertheless, these artifacts are representative of what can occur in real video conferencing scenarios.
| Offline | Zoom | Lark | Tencent Meeting | DingTalk |
|---|---|---|---|---|
| Condition: Plain | ||||
| Condition: Lombard (80 dB) | ||||
| Offline | Zoom | Lark | Tencent Meeting | DingTalk |
|---|---|---|---|---|
| Condition: Plain | ||||
| Condition: Lombard (80 dB) | ||||
| Offline | Zoom | Lark | Tencent Meeting | DingTalk |
|---|---|---|---|---|
| Condition: Plain | ||||
| Condition: Lombard (80 dB) | ||||
| Offline | Zoom | Lark | Tencent Meeting | DingTalk |
|---|---|---|---|---|
| Condition: Plain | ||||
| Condition: Lombard (80 dB) | ||||
Introduction: In this part, we present video samples under the Plain and Lombard conditions. We provide video samples recorded under different noise levels (background noise at 40 dB, 60 dB, and 80 dB). We also present Plain and Lombard video samples recorded in both the offline setting and the online setting, using Zoom as an example.
Observations: As the level of background noise increases, the speaker’s Lombard effect becomes stronger, which is manifested in a slower speaking rate, increased loudness, and higher frequency.
| Plain | 40 dB | 60 dB | 80 dB |
|---|---|---|---|
| Platform: Offline | |||
| Platform: Zoom | |||
| Plain | 40 dB | 60 dB | 80 dB |
|---|---|---|---|
| Platform: Offline | |||
| Platform: Zoom | |||
| Plain | 40 dB | 60 dB | 80 dB |
|---|---|---|---|
| Platform: Offline | |||
| Platform: Zoom | |||
| Plain | 40 dB | 60 dB | 80 dB |
|---|---|---|---|
| Platform: Offline | |||
| Platform: Zoom | |||
To further investigate the impact of MLD-VC on video conferencing, we fine-tune the LiPS-AVSR [1] model on our proposed MLD-VC dataset and evaluate its performance across multiple video conferencing platforms and datasets. In particular, we report results on the Chinese-LiPS dataset [1] and on the MLD-VC test split.
| Test Dataset | Platform | Finetune | CER (%) ↓ | Reduction (%) |
|---|---|---|---|---|
| Chinese-LiPS [1] | Tencent Meeting | × | 10.97 | * |
| Tencent Meeting | ✓ | 9.65 | 12.0 | |
| Lark | × | 18.53 | * | |
| Lark | ✓ | 13.64 | 26.4 | |
| Zoom | × | 9.22 | * | |
| Zoom | ✓ | 7.93 | 14.0 | |
| MLD-VC (Ours) | * | × | 42.37 | * |
| * | ✓ | 13.91 | 67.2 |
Models fine-tuned with MLD-VC achieve consistent improvements across all evaluated video conferencing platforms. On Chinese-LiPS [1], fine-tuning yields an average relative CER reduction of 17.5% over Tencent Meeting, Lark, and Zoom. On the in-domain MLD-VC test set, the CER drops from 42.37% to 13.91%, corresponding to a 67.2% relative reduction. These results indicate that MLD-VC not only enhances in-domain performance but also substantially boosts cross-platform generalization in VC scenarios.
| Online | Hyper-expression | Tencent Meeting | Lark | Zoom |
|---|---|---|---|---|
| ✓ | ✓ | 9.65 | 13.64 | 7.93 |
| × | ✓ | 10.15 | 15.52 | 10.53 |
| ✓ | × | 10.01 | 14.48 | 9.61 |
We further conduct an ablation study to disentangle the effects of online recording conditions and spontaneous hyper-expression. Removing online data leads to an average CER increase of 15.9% across the three video conferencing platforms, while removing hyper-expression samples results in a 10.5% CER increase. These findings demonstrate that both realistic transmission distortions and spontaneous human hyper-expressions are indispensable for improving AVSR robustness in video conferencing. Their combination enables the model to better match real-world video conferencing conditions and validates the design rationale of MLD-VC.
[1] Jinghua Zhao, Yuhang Jia, Shiyao Wang, Jiaming Zhou, Hui Wang, and Yong Qin. Chinese-LiPS: A Chinese audio-visual speech recognition dataset with lip-reading and presentation slides. arXiv preprint arXiv:2504.15066, 2025.